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Abstract. Efficient ontology debugging is a cornerstone for many activities in 
the context of the Semantic Web, especially when automatic tools produce (parts 
of) ontologies such as in the field of ontology matching. The best currently known 
interactive debugging systems rely upon some meta information in terms of fault 
probabilities, which can speed up the debugging procedure in the good case, but 
can also have negative impact on the performance in the bad case. The problem 
is that assessment of the meta information is only possible a-posteriori. Conse- 
quently, as long as the actual fault is unknown, there is always some risk of sub- 
optimal interactive diagnoses discrimination. As an alternative, one might prefer 
to rely on a tool which pursues a no-risk strategy. In this case, however, possi- 
bly well-chosen meta information cannot be exploited, resulting again in ineffi- 
cient debugging actions. In this work we present a reinforcement learning strategy 
that continuously adapts its behavior depending on the performance achieved and 
minimizes the risk of using low-quality meta information. Therefore, this method 
is suitable for application scenarios where reliable a-priori fault estimates are 
difficult to obtain. Using problematic ontologies in the field of ontology match- 
ing, we show that the proposed risk-aware query strategy outperforms both active 
learning approaches and no-risk strategies on average in terms of required amount 
of user interaction. 



1 Introduction 

The foundation for widespread adoption of Semantic Web technologies is a broad com- 
munity of ontology developers which is not restricted to experienced knowledge engi- 
neers. Instead, domain experts from diverse fields should be able to create ontologies 
incorporating their knowledge as autonomously as possible. The resulting ontologies 
are required to fulfill some minimal quality criteria, usually consistency, coherency and 
no undesired entailments, in order to grant successful deployment. However, the correct 
formulation of logical descriptions in ontologies is an error-prone task which accounts 
for a need for assistance in ontology development in terms of ontology debugging tools. 
Usually, such tools [ 14.7 2.4 1 use model-based diagnosis fl3| to identify sets of faulty 
axioms, called diagnoses, that need to be modified or deleted in order to meet the im- 
posed quality requirements. The major challenge inherent in the debugging task is often 
a substantial number of alternative diagnoses. This problem has been addressed in [ 15 1 
by proposing a debugging method based on active learning which exploits additional 
information in terms of queries to a user about the intended ontology. Thereby, the se- 
lection of queries is guided by the specification of some meta information, i.e. prior 
knowledge about fault probabilities of a user w.r.t. particular logical operators. When 



chosen appropriately, this meta information proved to be very useful in that the interac- 
tion with the domain expert can be drastically reduced. However, given that only poor 
prior knowledge is available, the amount of interaction increased compared to methods 
which manifest constant performance without taking into account any meta informa- 
tion. 

A similar interactive technique can be found in [XI], where queries to a user are 
incorporated to revise an ontology. Ontology revision aims at partitioning a given on- 
tology into a set of correct axioms and a set of incorrect ones. The system can deal with 
inconsistent/incoherent ontologies only after a union of all axioms causing these prob- 
lems is identified and added to the initial set of incorrect axioms. Computation of these 
axioms, however, requires ontology debugging, which is not addressed in the paper. 

In a debugging scenario involving a faulty ontology developed by one expert, the 
meta information might be extracted from the logs of previous sessions, if available, or 
specified by the expert based on their experience w.r.t. own faults. However, in scenarios 
involving automatized systems producing (parts of) ontologies, e.g. ontology alignment 
and ontology learning, or numerous users collaborating in modeling an ontology, the 
choice of reasonable meta information is rather unclear. If, on the one hand, an active 
learning method is used relying on a guess of the meta information, this might result 
in an overhead w.r.t. user interaction of more than 2000%. If one wants to play it safe, 
on the other hand, by deciding not to exploit any meta information at all, this might 
also result in substantial extra time and effort for the user. So, thitherto one is spoilt for 
choice between strategies with high potential but also high risk, or methods with no risk 
but also no potential. 

In this work we present an ontology debugging approach with high potential and 
low risk, which allows to minimize user interaction throughout a debugging session 
on average, without depending on high-quality meta information. By virtue of its re- 
inforcement learning capability, our approach is optimally suited for debugging on- 
tologies where only vague or no meta information is available. On the one hand, our 
method takes advantage of the given meta information as long as good performance is 
achieved. On the other hand, it gradually gets more independent of meta information if 
suboptimal behavior is measured. Moreover, our strategy can take into account an ex- 
pert's subjective quality estimation of the meta information. In this way an expert may 
decide to take influence on the algorithm's behavior by limiting the range of admissi- 
ble values the learning parameter may take. Alternatively, the algorithm acts freely and 
finds a profitable strategy on its own. This is accomplished by constantly improving the 
quality of meta information and adapting a risk parameter based on the new informa- 
tion obtained by queries answered by the user. This means that, in case of good meta 
information, the performance of our method will be close to the performance of the 
active learning method, whereas, in case of bad meta information, the achieved perfor- 
mance will approach the performance of the risk-free strategy. So, our approach can be 
seen as a risk optimization strategy (RIO) which combines the benefits of active learn- 
ing and risk-free strategies. Experiments on two datasets of faulty ontologies show the 
feasibility, efficiency and scalability of RIO. The evaluation of these experiments will 
manifest that, on average, RIO is the best choice of strategy for both good and bad meta 
information with savings in terms of user interaction of up to 80%. 



The problem specification, basic concepts and a motivating example are provided in 
Section[2] Section[3]explains the suggested approach and gives implementation details. 
Evaluation results are described in Section|4] Section|5]concludes. 



2 Basic Concepts and Motivation 

Ontology debugging deals with the following problem: Given is an ontology O which 
does not meet postulated requirements R, e.g. R = {coherency, consistency}. O is a 
set of axioms formulated in some monotonic knowledge representation language, e.g. 
OWL. The task is to find a subset of axioms in O, called diagnosis, that needs to be 
altered or eliminated from the ontology in order to meet the given requirements. To this 
end, our approach to ontology debugging presumes sound and complete procedures for 
deciding logical consistency and for calculating logical entailments, which are used as a 
black box. For OWL, e.g., both functionalities are provided by a standard DL-reasoner. 

Generally, there are many diagnoses for one and the same faulty ontology O. The 
problem is then to figure out the single diagnosis, called target diagnosis T> t , that com- 
plies with the knowledge to be modeled by the intended ontology. In interactive ontol- 
ogy debugging we assume a user, e.g. the author of the faulty ontology or a domain 
expert, interacting with an ontology debugging system by answering queries about en- 
tailments of the desired ontology, called the target ontology Ot- The target ontology 
can be understood as O minus the axioms of T) t plus additional axioms EXx> t which 
can be added in order to regain desired entailments which might have been eliminated 
together with axioms in T) t . Note that the user is not expected to know Ot explicitly 
(in which case there would be no need to consult an ontology debugger), but implic- 
itly in that they are able to answer queries about Ot- Roughly speaking, each query is 
a set of logical descriptions and the user is queried whether the conjunction of these 
descriptions is entailed by Ot- Every positively (negatively) answered query constitutes 
a positive (negative) test case fulfilled by O t ■ The set of positive (entailed) and negative 
(non-entailed) test cases is denoted by P and N, respectively. So, P and N are sets of 
sets of axioms, which can be, but do not need to be, initially empty. Test cases can be 
seen as constraints Ot must satisfy and are therefore used to gradually reduce the search 
space for valid diagnoses. Simply put, the overall procedure consists of (1) computing 
a predefined number of diagnoses, (2) gathering additional information by querying the 
user, (3) incorporating this information to cut irrelevant areas off the search space, and 
so forth, until the search space is reduced to a single (target) diagnosis T> t . 

The general debugging setting we consider also envisions the opportunity for the 
user to specify some background knowledge B, i.e. a set of axioms which are known to 
be correct. B is then incorporated in the calculations throughout the ontology debugging 
procedure. For example, in case the user knows that a subset of axioms in O is definitely 
sound, all axioms in this subset are added to B before initiating the debugging session. 
Then, B and O \ B partition the original ontology into a set of correct and possibly 
incorrect axioms, respectively. In the debugging session, only O :— O \ B is used to 
search for diagnoses. This can reduce the search space for diagnoses substantially. 

More formally, ontology debugging can be defined in terms of conditions a target 
ontology must fulfill, which leads to the definition of a diagnosis problem instance, for 
which we search for solutions, i.e. diagnoses: 



Definition 1 (Target Ontology, Diagnosis Problem Instance). Let O — (T, A) de- 
note an ontology consisting of a set of terminological axioms T and a set of assertional 
axioms A, P a set of positive test cases, N a set of negative test cases, B a set of 
background knowledge axioms, and R a set of requirements to an ontologj^ Then an 
ontology Ot is called target ontology iff all the following conditions are fulfilled: 

Vr £ R : O t UB fulfills r 
Vp £ P : O t UB \= p 
V n £ N : O t UB y= n 

The tuple (O, B, P, N)r is called a diagnosis problem instance iffB U {{J peP p) J£ n 
for all n £ N and O is not a target ontology, i.e. O violates at least one of the conditions 
above. 

Definition 2 (Diagnosis). We call T> C O a diagnosis w.r.t. a diagnosis problem in- 
stance (O, B, P, N)r iff there exists a set of axioms EX-p such that (O \ T>) U EX-p 
is a target ontology. A diagnosis T> is minimal iff there is no T>' C T> such that T>' is 
a diagnosis. A diagnosis T> gives complete information about the correctness of each 
axiom axj. € O, i.e. all axi £ T> are assumed to be faulty and all axj € O \T> are 
assumed to be correct. The set of all minimal diagnoses is denoted by D. 

The identification of an extension EXx>, accomplished e.g. by some learning approach, 
is a crucial part of the ontology repair process. However, the formulation of a complete 
extension is outside the scope of this work where we focus on computing diagnoses. 
Following the approach suggested in |[T5l . we approximate EXx> by the set {J peP p. 

An immediate consequence of Definition[2]is: The more test cases are specified, the 
fewer minimal diagnoses D exist for a diagnosis problem instance. So, the uncertainty 
about the target diagnosis T> t € D is gradually reduced by specifying test cases. 
Example: Consider the OWL ontology O encompassing the following terminology T: 

ax i : PhD C Researcher 

aX2 '■ Researcher C DeptEmployee 

ax 3 : PhD Student C Student 

ax4 : Student C -^DeptMember 

ax 5 : PhDStudent C PhD 

axg : DeptEmployee C DeptMember 

and an assertional axiom A = {PhDStudent(s)}. Then O is inconsistent since it 
describes a PhD student as both a department member and not. 

Let us assume that the assertion PhD Student(s) is considered as correct and is 
thus added to the background theory, i.e. B = A, and both sets P and N are empty. 
Then, the set of minimal diagnoses D = \T>\ : \ax\\,T>2 : [0x2], 2?3 : [0x3], U4 : 
[0x4], £>5 : [0x5], 2?6 : [a^6]} f° r the given problem instance (T, A, 0,0). D can be 
computed by a diagnosis algorithm such as the one presented in J2)- 

With six diagnoses for six ontology axioms, this example might already give an 
idea that in many cases the number of diagnoses D can get very large. Without any prior 
knowledge, each of the diagnoses in D is equally likely to be the target diagnosis T> t . So, 



1 Throughout the paper we consider debugging of inconsistent and/or incoherent ontologies, i.e. 
whenever not stated explicitly we assume R — {consistency, coherency}. 



it depends on the specified test cases, i.e. answers to the queries asked to the user, which 
diagnosis will be the target diagnosis. The test cases, however, represent properties, 
i.e. entailments and non-entailments, of the target ontology O t := (0\T> t )U EXx> t 
and thus allow to constrain the possibilities for T> t . In order to define a query 031, 
the fact is exploited that ontologies O \ T>{ and O \ T>j resulting in application of 
different diagnoses T>i, T>j € D (T>i ^ Vj) entail different sets of logical descriptions. 
When we speak of entailments, we address the output computed by the classification 
and realization services of a reasoner. Formally, a query is defined as follows: 

Definition 3 (Query). A set of logical descriptions Xj is called a query iff there ex- 
ists a set of diagnoses C D' C D such that Xj is entailed by each ontology in 
{O* | V t G D'} where O* := (0\Vi)(J BU \J peP p. Asking a query Xj to a user 
means asking them (Ot \= Xj?). The set of all queries w.r.t. D is denoted by Xd|^] 

Each query Xj partitions the set of diagnoses D into (Df , Df , Dj) such that: 

Df = {V l | Of hi,} 

Df = {V, | O* U Xj is inconsistent} 

D§ = D \ (Df U Df ) 
If the answering of queries by a user u is modeled as a function a u '■ X — > {yes, no}, 
then the following holds: If a u (Xj) — yes, then Xj is added to the positive test cases, 
i.e. P 4— P U {Xj}, and all diagnoses in Df are rejected. Given that a u (Xj) = no, 
then N <— N U {Xj} and all diagnoses in Df are rejected. 

This allows us to formulate the subproblem of ontology debugging addressed in this 
work: 

Definition 4 (Diagnosis Discrimination). Given the set of diagnoses D = {V\ ,...,£>„} 

w.r.t. (0,B, P, N) r and a user u, find a sequence (X\ , . . . , X q ) of queries Xi € X 
with minimal q, such that D = {T) t } after assigning X^ i=1 _ q j each to either P iff 
a u (Xi) = yes or N iffa u (Xi) = no^\ 

A set of queries for a given set of diagnoses D can be generated as shown in Algo- 
rithm [l] In each iteration, for a set of diagnoses D p C D, the generator gets a set of 
logical descriptions X that are entailed by each ontology O* where T>i £ D p (function 
GETEntailments). These descriptions X are then used to classify the remaining di- 
agnoses in D \ D p in order to obtain the partition (D p , D^, D ) associated with X. 
Then, together with its partition, X is added to the set of queries X. Note that in real- 
world applications, investigation of all possible subsets of the set D might be infeasible. 
Thus, it is common to approximate the set of all minimal diagnoses by a set of leading 
diagnoses. This set comprises a predefined number n of minimal diagnoses. 

The query generation algorithm returns a set of queries X that generally contains a 
lot of elements. Therefore the authors in [ 15 1 suggested two query selection strategies. 
Split-in-half strategy, selects the query Xj which minimizes the following scoring 
function: 

S c spat (X,.) = ||Df|-|Df|| + |D0| 

2 For the sake of simplicity, we will use X instead of Xd throughout this work because the D 
associated with X will be clear from the context. 

3 Since the user u is assumed fixed throughout a debugging session and for brevity, we will use 
a,i equivalent to a u (Xi) in the rest of this work. 



Algorithm 1: Query Generation 



Input: diagnosis problem instance (O, B, P 1 N), set of diagnoses D 
Output: a set of queries and associated partitions X 

1 foreach D p C D do 

2 X «- getEntailments(e>,B,P,D- p ); 

3 itX ^0 then 

4 foreach V r £ D \ D p do 

5 HO* |= X thenD p <- D p U {V r }; 

6 else ttO'UX is inconsistent then D" <- D N U {XV}; 

7 _ elseD <- D B U {T> r }; 

8 X <- XU (x,d p ,d",d') 

9 return X; 



I.e. this strategy prefers queries which eliminate half of the diagnoses independently of 
the query outcome. 

Entropy-based strategy, uses information about prior probabilities p t for the user to 
make a fault when using a syntactical construct of type t 6 CT where CT is the set 
of construct types available in the used logical description language. E.g., V, 3, C, -i, 
U, n are some OWL DL construct types. These fault probabilities p t are assumed to 
be independent and used to calculate fault probabilities of axioms axk as p(axk) = 
1 — IlteCT(l ~ Pt) n ^ where n(t) is the number of occurrences of construct type t in 
axk.. The probabilities of axioms can in turn be used to determine fault probabilities of 
diagnoses T>i E D as 

p(pi)= n p( ax '-) n (i) 

ax r eT>i ax s eO\Vt 

The strategy is then to select the query which minimizes the expected entropy of the 
set of leading diagnoses D after the query is answered. This means that the expected 
uncertainty is minimized and the expected information gain is maximized. According to 
[8], this is equivalent to choosing the query Xj which minimizes the following scoring 
function: 

SC ent {Xj)= ^ P( a j) lo &2P( a j) +P( D j) + 1 
cij £ {yes, no) 

This function is minimized by queries Xj withp(Dj) = p(D^) = 0.5. So, entropy- 
based query selection favors queries whose outcome is most uncertain. After each query 



Algorithm 2: Generic Diagnosis Discrimination 

Input: diagnosis problem instance (O, 6, P, N), set of diagnoses D, set of prior fault probabilities DP 
Output: target diagnosis {T>t} 

1 repeat 

2 X getBestQuery (D, DP) ; 

3 if getAnswer (Xj =yes then D <- D \ D™; P(-PU {X} ; 

4 else D •<— D \ D p ; N <— N U {X}; 

5 until |D| = 1; 

6 return D; 



Xj, the diagnosis probabilities are updated according to the Bayesian formula: 



where aj e {yes, no}, 

p(aj = yes) = p(D r ) + - ^ V k 



2 

■D r enf -D fc eD» 

wdp{aj\Dk) := 1/2 for "Z^^ G D®,p(aj|2?fc) = if is rejected by the query answer 
aj and 1 otherwise. 

A generic diagnosis discrimination algorithm (see Algorithm^ can use either of the 
strategies to identify the target diagnosis T> t . The selection strategy implemented in the 
GETBestQuery function determines the sequence of queries. The result of the evalu- 
ation in |[T51 shows that entropy-based query selection reveals better performance than 
split-in-half in most of the cases. However, split-in-half proved to be the best strategy 
in situations when only vague priors are provided, i.e. the target diagnosis T> t has rather 
low prior fault probability. Therefore selection of prior fault probabilities is crucial for 
successful query selection and minimization of user interaction. 

Example (continued): To illustrate this, let a user who wants to debug our example 
ontology O setp(axi) :— 0.001 for ax^ i= i 4) andp(ax 5 ) :— 0.1, p(axe) :— 0.15, 
e.g. because the user doubts the correctness of ax^, ax§ while being quite sure that 
a£i(i=i,...,4) are correct. Assume that T) 2 corresponds to the target diagnosis T> t , i.e. the 
settings provided by the user are inept. Application of entropy-based query selection 
starts with computation of prior fault probabilities of diagnoses p{T>i) — p(T> 2 ) = 
p(T> 3 ) = p (V 4 ) = 0.003, p(V 5 ) = 0.393, p(V 6 ) = 0.591 (Formula[l). Then X\, i.e. 
(Ot |= {DeptEmployee(s), Student(s)}?), will be identified as the optimal query 
since it has the minimal score sc en t{Xi) — 0.02 (see Table [jj. However, since the 
unfavorable answer a\ = no is given, this query eliminates only two diagnoses V4 
and 2?6 (worst case elimination rate e wc {X{) = |). The probability update given by 
Formula|2]then yields p(T> 2 ) = p(V 3 ) = p{V 4 ) = 0.01 and p(V 5 ) = 0.97. As the next 
query X 2 with sc en t(X 2 ) — 0.811 is selected and answered unfavorably (a 2 = yes) as 
well which results in the elimination of only one single diagnosis P5 (e wc (X2) — \). 
Since the worst case elimination rate e wc (X2) is minimal, we call X 2 a high-risk query. 
By querying X3 (sc en t{Xj,) = 0.082, 03 = yes) and X4 (sc(Xi) = 0, 04 = yes), 
the further execution of this procedure finally leads to the target diagnosis V 2 . So, by 
applying sc ent , four queries are required in order to find T> t . If queries are selected by 
sCgput, on the contrary, only three queries are required. The algorithm can select one 
of the two queries X$ or Xg because each eliminates half of all diagnoses in any case 
(). We call such a query a no-risk query. Let the strategy select X$ which is answered 
positively {a^ = yes). As successive queries, X§ (a,Q = no) and X\ {a\ ~ no) are 
selected, which leads to the revelation of T> t =T> 2 . 

This example demonstrates that the no-risk strategy sc sp i it (three queries) is more 
suitable than sc ent (four queries) for fault probabilities which disfavor the target di- 
agnosis. Let us suppose, on the other hand, that probabilities are assigned more rea- 
sonably in our example, e.g. T> t = T>q. Then it will take the entropy -based strategy 
only two queries (X\, Xq) to find T> t while split-in-half will still require three queries, 



e.g. (X5, Xi, Xq). The complexity of sc en t in terms of required queries varies between 
0(1) in the best and 0(|D|) in the worst case depending on the appropriateness of the 
fault probabilities. In contrast, sc sp u t always requires 0(log 2 |D|) queries. 

We learn from this example that the best choice of discrimination strategy depends 
on the quality of the meta information in terms of prior fault probabilities. In cases 
where adequate meta information is not available and hard to estimate, e.g. ontol- 
ogy alignment and ontology learning, the inappropriate choice of strategy might cause 
tremendous extra effort for the user interacting with the debugging system. Therefore, 
we suggest to exploit additional information gathered by querying the oracle in order to 
estimate the quality of given meta information. The new strategy we present incorpo- 
rates the elimination rate achieved by the current query when choosing the successive 
query. To this end, a parameter of maximum allowed query-risk is permanently adapted. 
Our method combines the advantages of both the entropy -based approach and the split- 
in-half approach. On the one hand, it exploits the given prior fault probabilities if they 
are of high quality. On the other hand, it quickly loses trust in the priors and gets more 
cautious if some evidence is given that the probabilities are misleading. 
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Table 1. Nine queries computed with respect to entailed assertional axioms for diagnoses Pi 6 D 
of the sample ontology O. Given that no diagnoses have been eliminated yet, X 2 , X4, X$ are 
high-risk queries, X5, Xg are no-risk queries. 



3 Risk Optimization Strategy for Query Selection 

The proposed Risk Optimization Algorithm (RIO) extends entropy-based query selec- 
tion strategy with a dynamic learning procedure that leams by reinforcement how to 
select optimal queries. Moreover, it continually improves the prior fault probabilities 
based on new knowledge obtained through queries to a user. The behavior of our al- 
gorithm can be co-determined by the user. The algorithm takes into account the user's 
doubt about the priors in terms of the initial cautiousness c as well as the cautiousness 
interval [c,c] where c, c, c € [c m i n , c max ] := [0, [|D|/2J /|D|], c < c < c and D con- 
tains at most n leading diagnoses (see Section|2|. The interval [c, c] constitutes the set 
of all admissible cautiousness values the algorithm may take during the debugging ses- 
sion. High trust in the prior fault probabilities is reflected by specifying a low minimum 



required cautiousness c and/or a low maximum admissible cautiousness c. If the user is 
unsure about the rationality of the priors this can be expressed by setting c and/or c to 
a higher value. Intuitively, c — c m i n and c max — c represent the minimal desired differ- 
ence in performance to a high-risk (entropy) and no-risk (split-in-half) query selection, 
respectively. 

The relationship between cautiousness c and queries is formalized by the following 
definitions: 

Definition5 (Cautiousness of a Query). We define the cautiousness caut(Xi) of a 
query Xi as follows: 



A query Xi is called braver than query Xj iff caut(Xi) < caut(Xj). Otherwise Xi 
is called more cautious than Xj. A query with highest possible cautiousness is called 
no -risk query. 

Definition 6 (Elimination Rate). Given a query Xi and the corresponding answer 
di G {yes, no}, the elimination rate e(Xj, af) is defined as follows: 



The answer on to a query Xi is called favorable iff it maximizes the elimination rate 
e(Xi, Oj). Otherwise di is called unfavorable. The minimal or worst case elimination 
rate rnin o<e / :veSi „ \(e(Xi, a^)) of Xi is denoted by e wc {Xi). 

So, the cautiousness caut(Xi) of a query Xi is exactly the minimal, i.e. worst case, 
elimination rate, i.e. caut(Xi) = e wc (Xi) = e(Xj, a«) given that a% is the unfavorable 
query result. Intuitively, the user-defined cautiousness c is the minimum proportion of 
diagnoses in D which should be eliminated by the successive query. For braver queries 
the interval between minimum and maximum elimination rate is larger than for more 
cautious queries. For no-risk queries it is minimal. 

Definition 7 (High-Risk Query). Given a query Xi and cautiousness c, then Xi is 
called a high-risk query iff caut(Xi) < c, i.e. the cautiousness of the query is lower 
than the algorithm 's current cautiousness value c. Otherwise, Xi is called non-high- 
risk query. By ffi? c (X) C X we denote the set of all high-risk queries w.r.t. c. For 
given cautiousness c, the set of all queries X can be partitioned in high-risk queries 
and non-high-risk queries. 

Example (continued): Reconsider the example given in Section[2] Let the user specify 
c = 0.3 for the set D including n — 6 diagnoses. Given these settings, X\ is a non- 
high-risk query since its cautiousness caut{Xf) — 2/6 > 0.3 = c. The query X s is a 
high-risk query because caut(X2) = 1/6 < 0.3 = c and X§ is a no-risk query due to 





cawi(X 5 )=3/6=L^ i J/|D|. 



Given a user's answer a s to a query X s , the cautiousness c is updated depending 
on the elimination rate e(X s) a s ) by c 4— c + c a dj where c a dj denotes the cautiousness 
adjustment factor which is defined as follows: 

c adj := 2 (c- c)adj (3) 

The factor 2 (c — c) in Formula|3]is a scaling factor that simply regulates the extent of 
the cautiousness adjustment depending on the interval length c — c. The more crucial 
factor in the formula is adj which indicates the sign and magnitude of the cautiousness 
adjustment. 



adj 



IDI 



e(X s ,a s ) 



where e G (0, 5) is a constant which prevents the algorithm from getting stuck in a 
no-risk strategy for even |D|. E.g., given c = 0.5 and e = 0, the elimination rate of a 
no-risk query e(X Sl a s ) = 5 resulting always in adj = 0. The value of e can be set to 
an arbitrary real number, e.g. e := \. If c + c a dj is outside the user-defined cautiousness 
interval [c, c], it is set to c if c < c and to c if c > c. Positive c ad j is a penalty telling the 
algorithm to get more cautious, whereas negative c a dj is a bonus resulting in a braver 
behavior of the algorithm. 

Example (continued): Assume that an expert is quite unsure about the location of the 
fault and thus sets c = 0.4, c = and c = 0.5. In this case the algorithm selects a 
no-risk query X§ just as the split-in-half strategy. Given a§ = yes and |D| = 6, the 
algorithm computes the elimination rate e(X$,yes) = 0.5 and adjusts the cautiousness 
by Cadj = —0.17 which yields c = 0.23. This allows RIO to select a higher-risk query 
in the next iteration. The algorithm finds the target diagnosis T) t = T>2 by asking three 
queries. 

The RIO algorithm, described in Algorithm [3] starts with the computation of mini- 
mal diagnoses. GETDlAGNOSES function implements a combination of hitting-set (HS- 
Tree) [13| and QuickXPlain [6| algorithms as suggested in 11151 . Using uniform cost 
search, the algorithm extends the set of leading diagnoses D with a maximum number 
of most probable minimal diagnoses such that |D| < n. 

Then the GETPROBABILITIES function calculates the fault probabilities p(T>i) for 
each diagnosis T). L of the set of leading diagnoses D using Formula [T] In order to take 



Algorithm 3: Risk Optimization Algorithm (RIO) 



Input: diagnosis problem instance (O, S, P , N) , fault probabilities of diagnoses DP, cautiousness 

C — (c, c, c), number of leading diagnoses n to be considered, acceptance threshold a 
Output: a diagnosis T> 

1 P^H;JV<-();D(-«; 

2 repeat 

3 D <- getDiagnoses(D, n, O, B, P, N); 

4 DP <— getProbabilities(_DP, D, P, N); 

5 X <— generateQueries(C, B, P, D); 

6 X 3 <— getMinScoreQuery(L>P, X); 

7 if getQueryCautiousness^s , D) < cthen X s <— getAlternativeQuery(c, X, DP, D); 

8 if getAnswer(Jf s ) = yes then P(-PU {X s }; 

9 else N <- N U {X s }; 

10 c 4— updateCautiousness (D, P , N , X s , c, c, c); 

11 until (aboveThreshold(r>P, a) V eliminationRate(X s ) = 0); 

12 return mostProbableDiag(D, DP); 



into account all information gathered by querying an oracle so far the algorithm adjusts 
fault probabilities p(T>i) as follows: Padji^i) = where z is the number 

of precedent queries X^ for which T>i G D^. Afterwards the probabilities p a dj{T^i) 
are normalized. Note that z can be computed from P and N which comprise all query 
answers. This way of updating probabilities is exactly in compliance with the Bayesian 
theorem given by Formula [2] Based on the set of leading diagnoses D, GENERATE- 
QUERIES generates all queries according to Algorithm[T] GETMinScoreQuery de- 
termines the best query X sc e X according to sc ent - That is: 

X sc = argmin(sc e „ t (X fe )) 

If X sc is a non-high-risk query, i.e. c < caut{X sc ) (determined by GETQueryCau- 
TIOUSNESS), X sc is selected. In this case, X sc is the query with maximum informa- 
tion gain among all queries X and additionally guarantees the required elimination rate 
specified by c. 

Otherwise, getAlternativeQuery selects the query X aU e X {X a u ^ X sc ) 
which has minimal score sc en t among all least cautious non-high-risk queries L c . I.e.: 

X a it = argmin(sc e „ t (X fe )) 

X k £L c 

where L c = {X r eX \ HR C (X) \ \fX t G X \ HR C (X) : caut(X r ) < caut(X t )}. If 
there is no such query X a i t £ X, then X sc is selected. 

Given the positive answer of the oracle, the selected query X s £ {X sc , X a i t } is 
added to the set of positive test cases P or, otherwise, to the set of negative test cases 
N. In the last step of the main loop the algorithm updates the cautiousness value c 
(function UPDATECAUTIOUSNESS) as described above. 

Before the next query selection iteration starts, a stop condition test is performed. 
The algorithm evaluates whether the most probable diagnosis is at least a% more likely 
than the second most probable diagnosis (ABOVEThreshold) or none of the leading 
diagnoses has been eliminated by the previous query, i.e.GETELlMlNATlONRATE re- 
turns zero for X s . In case that one of the stop conditions is fulfilled, the presently most 
likely diagnosis is returned (mostProbableDiag). 

4 Evaluation 

The main points we want to show in this evaluation are: On the one hand, independently 
of the specified meta information, RIO exhibits superior average behavior compared to 
entropy-based method and split-in-half w.r.t. the amount of user interaction required. 
On the other hand, we want to demonstrate that RIO scales well and that the reaction 
time measured is well suited for an interactive debugging approach. 

As data source for the evaluation we used problematic real-world ontologies pro- 
duced by ontology matching systems]^] This has the following reasons: (1) Matching 
results often cause inconsistency and/or incoherency of ontologies. (2) The (fault) struc- 
ture of different ontologies obtained through matching generally varies due to different 
authors and matching systems involved in the genesis of these ontologies. (3) For the 



4 Thanks to Christian Meilicke for the supply of the test cases used in the evaluation. 



same reasons, it is hard to estimate the quality of fault probabilities, i.e. it is unclear 
which of the existing query selection strategies to chose for best performance. (4) Avail- 
able reference mappings can be used as correct solutions of the debugging procedure. 

Note that the comparison of RIO with techniques integrated in ontology matching 
systems such as CODI lfT2l or LogMap [5 1 is inappropriate, since all these systems use 
greedy diagnosis techniques (e.g. [9]), whereas the method presented in this paper is 
complete. 

Matching of two ontologies Oi and Oj is usually understood as detection of corre- 
spondences between elements of these ontologies lfT6ll : 

Definition 8 (Ontology matching). Let Q(Oi) and Q(Oj) denote the sets of match- 
able elements in ontologies O t and Oj. An ontology matching operation determines an 
alignment My, which is a set of correspondences between matched ontologies Oi and 
Oj. Each correspondence is a 4-tuple (xj, Xj, r, v), such that Xi G Q(Oi), xj € Q(Oj), 
r is a semantic relation and v G [0,1] is a confidence value. We call OiMj '■— Oi U 
Mij U Oj the aligned ontology for Oi and Oj. 

In our approach the elements of Q(0) are restricted to atomic concepts and roles and 
r G {C, 3,=} under the natural alignment semantics [9| that maps correspondences 
one-to-one to axioms of the form Xi r Xj . 

Example (continued): Imagine that our example ontology O evolved from matching 
two standalone ontologies 0\ := {ax\, ax-i) and Oi :— {ax^, axi] resulting in the 
alignment M\2 = {ax^, axg}. As a concrete use case, for instance, assume two de- 
partments of a university, each developing an ontology for their homepage where 0\ 
is an excerpt of the first ontology and O2 an excerpt of the second. In order to unite 
the homepages and underlying ontologies, an ontology matching system could be con- 
sulted. However, if, as in this case, an alignment M12 is generated which yields an 
inconsistent aligned ontology 0\mi, the output of the matching system as-is is useless 
and combining the homepages is impossible without according ontology debugging 
support. If we recall the set of diagnoses for O consisting of all single axioms in O, we 
realize that the fault we are trying to find may be located either in 0\ or in Oi or in 
M±2- Existing approaches to alignment debugging usually consider only the produced 
alignment as problem source. Our approach, on the contrary, is designed to cope with 
the most general setting: Any subset S C 0\mi of axioms of the aligned ontology 
can be analyzed for faults whereas O1M2 \ S can be added to the background axioms 
£>, if known to be correct. In this way, the search space for diagnoses can be restricted 
elegantly depending on the prior knowledge about V t , which can greatly reduce the 
complexity of the underlying diagnosis problem. 

In ifTTl it was shown that existing debugging approaches suffer from serious prob- 
lems w.r.t. both scalability and correctness of results when tested on a dataset of in- 
coherent aligned OWL ontologies. Since RIO is an interactive ontology debugging 
approach able to query and incorporate additional information into its computations, 
it can cope with cases unsolved in IfTTl . In order to provide evidence for this and to 
show the feasibility of RIO - simultaneously to the main goals of this evaluation - 
we decided to use a superset of the dataseQ used in [17] for our tests. Each incoher- 
ent aligned ontology OiMj m the dataset is the result of applying one of the ontology 
matching systems COMA++, Falcon-AO, HMatch or OWL-CTXmatch to a set of six 



http://code.google.eom/p/rmbd/downloads 



ontologies Ont = {CRS, PCS, CMT, CONFTOOL, SIGKDD, EKAW} in the domain 
of conference organization. For a given pair of ontologies (9j 7^ Oj £ Ont, each system 
produced an alignment My. On the basis of a manually produced reference alignment 
IZij C My for ontologies Oi , Oj (cf. ifTOl ). we were able to fix a target diagnosis T> t 
for each incoherent OiMj- hi cases where 7£y suggested a non-minimal diagnosis, we 
defined T> t as the minimum cardinality diagnosis which was a subset of My \ 7£.y . In 
one single case, "7£y proved to be incoherent because an obviously valid correspondence 
Reviewer \ = reviewer 2 turned out to be incorrect. We re-evaluated this ontology and 
specified a coherent 7£y. Yet this makes evident that, in general, people are not capable 
of analyzing alignments without adequate tool support. 

In our experiments we set the prior fault probabilities as follows: p(axk) '■= 0.001 
for axk G Oi U Oj and p(ax m ) := 1 — v m for ax m £ Mij, where v m is the confidence 
of the correspondence underlying ax m . Note that this choice results in a significant bias 
towards diagnoses which include axioms from My. Based on these settings, in the first 
experiment (EXP-1), we simulated an interactive debugging session employing split-in- 
half (SPL), entropy (ENT) and RIO algorithms, respectively, for each ontology OiMj- 
Throughout all experiments, we performed module extraction [3] before each test run, 
which is a standard preprocessing method for ontology debugging approaches. All tests 
were executed on a Core-i7 (3930K) 3.2Ghz, 32GB RAM and with Ubuntu Server 1 1 .04 
and Java 6 installed. The number |D| of leading diagnoses was set to 9 and a := 85%. 
As input parameters for RIO we set c := 0.25 and [c, c] :— [c m i n , c max ] = [0, |]. For the 
tests we considered the most general setting, i.e. T> t C OiMj - So, we did not restrict the 
search for T> t to My only, simulating the case where the user has no idea whether any 
of the input ontologies Oi, Oj or the alignment My or a combination thereof is faulty. 
In each test run we measured the number of required queries until T> t was identified. In 
order to simulate the case where the fault includes at least one axiom ax g OiMj \ My, 
we implemented a second test session with altered T> t . In this experiment (EXP-2), we 
precalculated a maximum of 30 most probable minimal diagnoses, and from these we 
selected the diagnosis with the highest number of axioms axk G OiMj \ Mij as T> t 
in order to simulate more unsuitable meta information. All the other settings were left 
unchanged. The queries generated in the tests were answered by an automatic oracle 
by means of the target ontology OiMj \ D t . The average metrics for the set of aligned 
ontologies OiMj per matching system were as follows: 312 < |0iMj| < 377 and 
19.1 < I My I < 28.4. 

In order to analyze the scalability of RIO, we used the set of ontologies from the 
ANATOMY track in the Ontology Alignment Evaluation Initiativ^ (OAEI) 2011.5, 
which comprises two input ontologies 0\ (Human, 11545 axioms) and O2 (Mouse, 
4838 axioms). The size of the alignments generated by 12 different matching systems 
was between 1147 and 1461 correspondences. Note that the aligned ontologies output 
by five matching systems, i.e. CODI, CSA, MaasMtch, MapEVO and Aroma, could 
not be analyzed in the experiments. This was due to a consistent output produced by 
CODI and the problem that the reasoner was not able to find a model within acceptable 
time (2 hours) in the case of CSA, MaasMtch, MapEVO and Aroma. Similar reasoning 
problems were also reported in [ 1 1. Given the ontologies 0\ and C 2 , the output M 12 of a 
matching system, and the correct reference alignment 7^12, we first fixed T> t as follows: 
Both ontologies 0\ and O2 as well as the correctly extracted alignments M\ 2 n 
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were placed in the background knowledge B. The incorrect correspondences Mn \ TZ\2 
were analyzed by the debugger. In this way, we identified a set of diagnoses, where each 
diagnosis is a subset of M 12 \7?.i2. From this set of diagnoses, we randomly selected one 
diagnosis as V t . Then we started the actual experiments: In EXP-3^] in order to simulate 
reasonable prior fault probabilities, a debugging session with parameter settings as in 
EXP-1 was executed. In EXP-4, we altered the settings in that we specified p(axk) '■= 
0.01 for axk <E Oi U Oj and p(ax m ) := 0.001 for ax m € Afy, which caused the target 
diagnosis, that consisted solely of axioms in My, to get assigned a relatively low prior 
fault probability. 

Results of both e xperim ental sessi ons, (EXP-l,EXP-2) and (EXP-3,EXP-4), are 
summarized in Figure 2(a) and Figure 2(b) respectively. For the ontologies produced 
by each of the matching systems and for the different experimental scenarios, the figures 
show the (average) number of queries asked by RIO and the (average) differences to the 
number of queries needed by the per-session better and worse strategy of SPL and ENT, 
respectively. The results illustrate clearly that the average performance achieved by RIO 
was always substantially closer to the better than to the worse strategy. In both EXP-1 
and EXP-2, throughout 74% of 27 debugging sessions, RIO worked as efficiently as 
the best strategy (Figure 1(a) i. In more than 25% of the cases in EXP-2, RIO even 
outperformed both other strategies; in these cases, RIO could save more than 20% of 
user interaction on average compared to the best other strategy. In one scenario involv- 
ing OWL-CTXmatch in EXP-1, it took ENT 31 and SPL 13 queries to finish, whereas 
RIO required only 6 queries, which amounts to an improvement of more than 80% and 
53%, respectively. In (EXP-3,EXP-4), the savings achieved by RIO were even more 
substantial. RIO manifested superior behavior to both other strategies in 29% and 71% 
of cases, respectively. Not less remarkable, in 100% of the tests in EXP-3 and EXP-4, 
RIO was at least as efficient as the best other strategy. Table |2j which provides the av- 
erage number of queries per strategy, demonstrates that, overall, RIO is the best choice 
in all experiments. Consequently, RIO is suitable for both good meta information as 
in EXP-1 and EXP-3, where T> t has high probability, and poor meta information as in 
EXP-2 and EXP-4, where T> t is a-priori less likely. Additionally, Table[2]illustrates the 
(average) overall debugging time assuming that queries are answered instantaneously 
and the reaction time, i.e. the average time between two successive queries. Also w.r.t. 
these aspects, RIO manifested good performance. Since the times consumed by either 
of the strategies in (EXP-1, EXP-2) are almost negligible, consider the more meaningful 
results obtained in (EXP-3,EXP-4). While the best reaction time in both experiments 
was achieved by SPL, we can clearly see that SPL was significantly inferior to both 
ENT and RIO concerning the user interaction required and the overall time. RIO re- 
vealed the best debugging time in EXP-4, and needed only 2.2% more time than the 
best strategy (ENT) in EXP-3. However, if we assume the user being capable of read- 
ing and answering a query in, e.g., half a minute on average, which is already quite 
fast, then the overall time savings of RIO compared to ENT in EXP-3 would already 
account for 5%. Doing the same thought experiment for EXP-4, using RIO instead of 
ENT and SPL would save 25% and 50% of debugging time on average, respectively. All 
in all, the measured times confirm that RIO is well suited as an interactive debugging 
approach. 



7 For all details w.r.t. (EXP-3, EXP-4), see http://code.google.eom/p/rmbd7wiki/ Ontology Align- 
mentAnatomy. 



For SPL and ENT strategies, the difference w.r.t. the number of queries per test run 
between the better and the worse strategy was absolutely significant, with a maximum 
of 2300% in EXP-4 and averages of 190% to 1145% throughout all four experiments, 
measured on the basis of the better strategy (Figure 1 b) I. Moreover, results show that 
the different quality of the prior fault probabilities in {EXP-1, EXP-3} compared to 
{EXP-2,EXP-4} clearly affected the performance of the ENT and SPL strategies (see 
first two rows in Figure [Ufa)] ). This perfectly motivates the application of RIO. 
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Fig. 1. |(a)| Percentage rates indicating which strategy performed best/better w.r.t. the required 
user interaction, i.e. number of queries. EXP-1 and EXP-2 involved 27, EXP-3 and EXP-4 seven 
debugging sessions each. q str denotes the number of queries needed by strategy str and min is an 
abbreviation for min(qsPL, 9ent). (b) Box-Whisker Plots presenting the distribution of overhead 
(q w — qt)/qt * 100 (in %) per debugging session of the worse strategy q w := max(gsPL, Qent) 
compared to the better strategy qt := min(gsPL, Qent). Mean values are depicted by a cross. 
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Fig. 2. The bars show the avg. number of queries (q) needed by RIO, grouped by matching tools. 
The distance from the bar to the lower (upper) end of the whisker indicates the avg. difference of 
RIO to the queries needed by the per-session better (worse) strategy of SPL and ENT, respectively. 

5 Conclusion 

We have shown problems of state-of-the-art interactive ontology debugging strategies 
w.r.t. the usage of unreliable meta information. To tackle this issue, we proposed a learn- 
ing strategy which combines the benefits of existing approaches, i.e. high potential and 
low risk. Depending on the performance of the diagnosis discrimination actions, the 
trust in the a-priori information is adapted. Tested under various conditions, our algo- 
rithm revealed an average performance superior to two common approaches in the field 
w.r.t. required user interaction. In our evaluation we showed the utility of our approach 
in the important area of ontology matching, its scalability and adequate reaction time 
allowing for continuous interactivity. 
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Table 2. Average time (ms) for the entire debugging session (debug), average time (ms) between 
two successive queries (react), and average number of queries (q) required by each strategy. 
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